本文考虑了$ k $ actions和$ d $ outcomes的部分监测问题,并提供了第一个最佳世界世界算法,其遗憾是在随机制度中的多层次,在随机状态下,在对抗性中近乎看法。政权。更具体地说,我们证明对于非分类本地可观察的游戏,随机制度中的遗憾是由$ o(k^3 m^2 \ log(t)\ log(k _ {\ pi} t) / \ delta _ {\ mathrm {\ min}})$,在$ o(k^{2/3} m \ sqrt {t \ log(t)\ log k _ {\ log k _ {\ pi}}})$中,在对抗状态下$ t $是回合的数量,$ m $是每个动作不同观察值的最大数量,$ \ delta _ {\ min} $是最小的最佳差距,$ k _ {\ pi} $是帕累托的最佳数量动作。此外,我们表明,对于非分类全球可观察的游戏,随机制度中的遗憾是由$ o(\ max \ {c _ {c _ {\ Mathcal {g}}}}^2 / k,\,c _ { }}} \} \ log(t)\ log(k _ {\ pi} t) / \ delta _ {\ min}^2)$,在$ o(\ max \ {c _ { }}}^2/k,\,c _ {\ mathcal {g}}} \} \ log(t)\ log(k _ {\ pi} t)))^{1/3} t} t^{2/3}) $,其中$ c _ {\ Mathcal {g}} $是游戏依赖的常数。我们的算法基于以下规范化领导者框架,该框架考虑了部分监视问题的性质,灵感来自在线学习领域中使用反馈图的算法。
translated by 谷歌翻译
本文考虑了多臂强盗(MAB)问题,并提供了一种新的最佳世界(BOBW)算法,该算法在随机和对抗性设置中几乎最佳地工作。在随机设置中,某些现有的BOBW算法获得了$ o的紧密依赖性遗憾界限(\ sum_ {i:\ delta_i> 0} \ frac {\ log t} {\ log t} {\ delta_i} {\ delta_i})手臂$ i $和时间范围$ t $。如Audibert等。 [2007]但是,在具有低变化的臂的随机环境中,可以改善性能。实际上,他们提供了一种随机mab算法,具有$ o的差距依赖性遗憾界限t)损失方差$ \ sigma_i^2 $ a臂$ i $。在本文中,我们提出了具有差距依赖性界限的第一个BOBW算法,表明即使在可能的对抗环境中,这些方差信息也可以使用。此外,我们的间隙变量依赖性结合中的领先常数因子仅是(几乎)下界值的两倍。此外,所提出的算法在对抗环境中享有多个与数据有关的遗憾界限,并且在具有对抗性腐败的随机设置中很好地工作。所提出的算法基于以下规范化的领导方法,并采用了自适应学习率,取决于损失的经验预测误差,这导致了差距变化依赖性的遗憾界限,反映了武器的方差。
translated by 谷歌翻译
我们考虑固定预算的最佳手臂识别问题,目标是找到具有固定数量样本的最大均值的手臂。众所周知,错误识别最好的手臂的概率对巡回赛的数量成倍小。但是,已经讨论了有关此值的速率(指数)的有限特征。在本文中,我们表征了由于所有可能的参数的全局优化而导致的最佳速率。我们介绍了两个费率,$ r^{\ mathrm {go}} $和$ r^{\ mathrm {go}} _ {\ infty} $,对应于错误识别概率的下限,每种范围都与A建议的算法。费率$ r^{\ mathrm {go}} $与$ r^{\ mathrm {go}} $ - 跟踪相关联,可以通过神经网络有效地实现,并显示出胜过现有的算法。但是,此速率要求可以实现非平凡的条件。为了解决这个问题,我们介绍了第二个速率$ r^{\ mathrm {go}} _ \ infty $。我们表明,通过引入一种称为延迟最佳跟踪(DOT)的概念算法,确实可以实现此速率。
translated by 谷歌翻译
我们研究了生存的匪徒问题,这是Perotto等人在开放问题中引入的多臂匪徒问题的变体。(2019年),对累积奖励有限制;在每个时间步骤中,代理都会获得(可能为负)奖励,如果累积奖励变得低于预先指定的阈值,则该过程停止,并且这种现象称为废墟。这是研究可能发生毁灭但并非总是如此的框架的第一篇论文。我们首先讨论,在对遗憾的天真定义下,统一的遗憾是无法实现的。接下来,我们就废墟的可能性(以及匹配的策略)提供紧密的下限。基于此下限,我们将生存后悔定义为最小化和提供统一生存后悔的政策的目标(至少在整体奖励的情况下),当时Time Horizon $ t $是已知的。
translated by 谷歌翻译
This study considers online learning with general directed feedback graphs. For this problem, we present best-of-both-worlds algorithms that achieve nearly tight regret bounds for adversarial environments as well as poly-logarithmic regret bounds for stochastic environments. As Alon et al. [2015] have shown, tight regret bounds depend on the structure of the feedback graph: strongly observable graphs yield minimax regret of $\tilde{\Theta}( \alpha^{1/2} T^{1/2} )$, while weakly observable graphs induce minimax regret of $\tilde{\Theta}( \delta^{1/3} T^{2/3} )$, where $\alpha$ and $\delta$, respectively, represent the independence number of the graph and the domination number of a certain portion of the graph. Our proposed algorithm for strongly observable graphs has a regret bound of $\tilde{O}( \alpha^{1/2} T^{1/2} ) $ for adversarial environments, as well as of $ {O} ( \frac{\alpha (\ln T)^3 }{\Delta_{\min}} ) $ for stochastic environments, where $\Delta_{\min}$ expresses the minimum suboptimality gap. This result resolves an open question raised by Erez and Koren [2021]. We also provide an algorithm for weakly observable graphs that achieves a regret bound of $\tilde{O}( \delta^{1/3}T^{2/3} )$ for adversarial environments and poly-logarithmic regret for stochastic environments. The proposed algorithms are based on the follow-the-regularized-leader approach combined with newly designed update rules for learning rates.
translated by 谷歌翻译
当我们配对输入$ x $和输出$ y $的培训数据时,普通监督学习很有用。但是,这种配对数据在实践中可能很难收集。在本文中,我们考虑了当我们没有配对数据时预测$ y $的任务,但是我们有两个单独的独立数据集,分别为$ x $,每个$ $ $ y $ y $ y $ y $ y $ y $ u $ u $ u $ $,也就是说,我们有两个数据集$ s_x = \ {(x_i,u_i)\} $和$ s_y = \ {(u'_j,y'_jj)\} $。一种天真的方法是使用$ s_x $从$ x $中预测$ u $,然后使用$ s_y $从$ u $ $ y $预测$ y $,但我们表明这在统计上不一致。此外,预测$ u $比预测$ y $在实践中更困难,例如$ u $具有更高的维度。为了避免难度,我们提出了一种避免预测$ u $的新方法,但直接通过培训$ f(x)$ $ s_ {x} $来预测$ y = f(x)$,以预测$ h(u)$经过$ s_ {y} $的培训,以近似$ y $。我们证明了我们方法的统计一致性和误差范围,并通过实验确认其实际实用性。
translated by 谷歌翻译
This document presents endeavors to represent emotion in a computational cognitive architecture. The first part introduces research organizing with two axes of emotional affect: pleasantness and arousal. Following this basic of emotional components, the document discusses an aspect of emergent properties of emotion, showing interaction studies with human users. With these past author's studies, the document concludes that the advantage of the cognitive human-agent interaction approach is in representing human internal states and processes.
translated by 谷歌翻译
Discriminativeness is a desirable feature of image captions: captions should describe the characteristic details of input images. However, recent high-performing captioning models, which are trained with reinforcement learning (RL), tend to generate overly generic captions despite their high performance in various other criteria. First, we investigate the cause of the unexpectedly low discriminativeness and show that RL has a deeply rooted side effect of limiting the output words to high-frequency words. The limited vocabulary is a severe bottleneck for discriminativeness as it is difficult for a model to describe the details beyond its vocabulary. Then, based on this identification of the bottleneck, we drastically recast discriminative image captioning as a much simpler task of encouraging low-frequency word generation. Hinted by long-tail classification and debiasing methods, we propose methods that easily switch off-the-shelf RL models to discriminativeness-aware models with only a single-epoch fine-tuning on the part of the parameters. Extensive experiments demonstrate that our methods significantly enhance the discriminativeness of off-the-shelf RL models and even outperform previous discriminativeness-aware methods with much smaller computational costs. Detailed analysis and human evaluation also verify that our methods boost the discriminativeness without sacrificing the overall quality of captions.
translated by 谷歌翻译
Our team, Hibikino-Musashi@Home (the shortened name is HMA), was founded in 2010. It is based in the Kitakyushu Science and Research Park, Japan. We have participated in the RoboCup@Home Japan open competition open platform league every year since 2010. Moreover, we participated in the RoboCup 2017 Nagoya as open platform league and domestic standard platform league teams. Currently, the Hibikino-Musashi@Home team has 20 members from seven different laboratories based in the Kyushu Institute of Technology. In this paper, we introduce the activities of our team and the technologies.
translated by 谷歌翻译
在本文中,我们报告了一项现场研究,在该研究中,我们在面包店使用了两个服务机器人作为促销活动。先前的研究探索了公共公共公众公共应用,例如购物中心。但是,需要更多的证据表明,服务机器人可以为真实商店的销售做出贡献。此外,在促销促销的背景下,客户和服务机器人的行为尚未得到很好的检查。因此,可以认为有效的机器人行为类型,并且客户对这些机器人的反应尚不清楚。为了解决这些问题,我们在面包店安装了两个远程操作的服务机器人将近2周,一个在入口处作为招待员,另一个在商店里推荐产品。结果表明,在应用机器人时,销售额急剧增加。此外,我们注释了机器人和客户行为的视频录制。我们发现,尽管放置在入口处的机器人成功吸引了路人的兴趣,但没有观察到访问商店的客户数量明显增加。但是,我们确认商店内部运行的机器人的建议确实产生了积极影响。我们详细讨论我们的发现,并为未来的研究和应用提供理论和实用建议。
translated by 谷歌翻译